Effect of size heterogeneity on community identification in complex networks 
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Identifying community structure can be a potent tool in the analysis and understanding of the 
structure of complex networks. Up to now, methods for evaluating the performance of identification 
algorithms use ad-hoc networks with communities of equal size. We show that inhomogeneities 
in community sizes can and do affect the performance of algorithms considerably, and propose 
an alternative method which takes these factors into account. Furthermore, we propose a simple 
modification of the algorithm proposed by Newman for community detection (Phys. Rev. E 69 
066133) which treats communities of different sizes on an equal footing, and show that it outperforms 
the original algorithm while retaining its speed. 
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I. INTRODUCTION 

Natural and artificial systems often have architectures 
which are best described as complex networks. The 
topologies of networks have been extensively studied in 
various disciplines in recent years, particularly within 
physics 0, S S ■ A part of that research has been 
directed at the study of modules or communities in net- 
works. Communities can be defined as subsets of nodes 
which are densely connected to each other and loosely 
connected to the rest of the network. Such structures 
have been discovered in networks as diverse as banking 
networks, metabolic networks, the airport network and 
most notably in social networks 0, 0, IS El EH ■ 

Despite efforts spanning several decades in this direc- 
tion [lllll^l) the identification of community structure in 
networks remains an open problem. The space of possible 
partitions of even a small network is very large indeed. 
Several methods have been proposed for finding mean- 
ingful partitions in networks of reasonable size. These 
methods vary considerably from one another, not only in 
their general approach, but also in sensitivity and compu- 
tational effort (for recent reviews, see [I^llJI an( i chapter 
7.1 of In general, those methods which are more ac- 
curate tend to be able to explore a larger portion of the 
partition space, and are therefore computationally expen- 
sive (see for example O n the other hand, those 
methods which explore a smaller region of the partition 
space tend to be faster, but as a consequence, less accu- 
rate The challenge, therefore, is to find methods 
which arc both fast and accurate, and several attempts 
have been made [H E3, |2l| . 

In this paper we reevaluate the benchmark most com- 
monly used at present to measure the sensitivity of a 
particular community identification algorithm [22]. This 
benchmark, although useful, does not take into ac- 
count the fact that networks exhibit community structure 
where the community sizes are highly skewed, despite the 
fact that several authors have observed that distributions 
of community sizes seem to follow power laws in many 



cases [HE |M |H |H HH. In the next section we 
propose a benchmark for measuring algorithm sensitiv- 
ity which takes this skew into account. In section ITTT1 we 
examine Newman's Fast algorithm (NF) for community 
detection ^tJ > an d see that it is affected by a skew in the 
community size distribution, showing a tendency to find 
large communities at the expense of smaller ones. We 
propose a modification of the algorithm, in which the 
communities of different sizes are treated equally, and in 
section llVl we show that it outperforms the NF algorithm 
in sensitivity, with no tradeoff in terms of computational 
effort. 



II. EVALUATING ALGORITHM 
PERFORMANCE ON AD-HOC NETWORKS 

To quantify how good a particular network par tition 
is, the modularity measure Q was introduced in [22| , and 
has been widely used since then. Based on a predefined 
set of communities i in a network, a community connec- 
tion matrix e<j is defined, where each member represents 
the proportion of links from community i to community 
j. Note that the matrix is normalised, that is, each of 
the members of the matrix = L L ' J , Lij being the 
number of links between community i and community j, 
and Ltotai is the total number of links in the network 
|22| . The proportion of links belonging to community i 
is denoted a,i and is simply the sum, eti = e,j . The 
computation of Q is as follows: 

Q = J2(eu~a 2 i ) (1) 

3 

The modularity, Q, quantifies the difference between 
the intra-community links and the expected value for the 
same communities in a randomised network. Note that 
the modularity is a relative value, and while it gives an 
idea of how good a partition of the network is, it cannot 
tell us whether this partition is the best one possible. It 
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does provide a useful way of comparing the performance 
of different community identification algorithms applied 
on one particular network. 

The method most commonly used to compare the sen- 
sitivity of community identification methods was also 
proposed in [2^, and is independent of the modularity 
measure. It uses a benchmark test based on networks 
typically containing 128 nodes grouped into four com- 
munities which contain the same number of nodes, 32, 
and links (on average 16 per node, k = 16). Pairs of 
nodes belonging to the same community are linked with 
probability pi n , whereas pairs belonging to different com- 
munities are joined with probability p ou t- The value of 
Pout controls the average number of links a node has to 
members of any other community, z out . While p ut (and 
therefore z out ) is varied freely, the value of p in is cho- 
sen to keep the total average node degree k constant. 
As z out is increased from zero, the communities become 
more and more fuzzy and harder to identify. Different 
community detection algorithms, when applied to these 
networks may give different results, reflecting their sensi- 
tivity. Since the 'real' community structure is well known 
in this case, it is possible to measure how well the par- 
titions the algorithm finds compare to the original parti- 
tions. 

Here we use a measure based on information theory 
for this purpose. The normalised mutual information, 
I(A,B), explicitly measures the amount of information 
about partition A that is gained by knowing partition 
B [27|, In other words, it is the amount of infor- 

mation the algorithm is able to extract from the pre- 
defined partition just from the topology. [T^ |. This in- 
dependent measure is based on defining a confusion ma- 
trix M, where rows correspond to "real" communities, 
and columns correspond to "found" communities. The 
element of M, My is the number of nodes in the real 
community i that appear in the found community j. A 
measure of similarity between the partitions, is then: 

_2V CA V CB M loir f MijN \ 
I(A,B) = 2 M » l0g U.M, 

Ei=i M, log (^) + Y£i M, log (#) 

(2) 

where the number of real communities is denoted ca 
and the number of found communities is denoted cb, N 
is the number of nodes, the sum over row i of matrix My 
is denoted AI, and the sum over column j is denoted 

Because of the particular definition of these ad-hoc net- 
works, it is tempting to think that similar networks with 
four communities sharing the same value of z ou t/k will 
have an equivalent community structure, and that a par- 
ticular method of community identification will perform 
equally well. This, however, is highly dependent on the 
number of nodes that the network has, and more im- 
portantly the number of nodes in each community. For 
example a network with 128 nodes with four communities 
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FIG. 1: (Colour online) Sensitivity of the NF algorithm and 
the modification described in Section 11111 applied to ad-hoc 
networks with four equal-sized communities, for two network 
sizes, 128 nodes and 512 nodes, with average degree k = 16. 
The top figure shows the variation of modularity found by the 
algorithms with z out /k. For low values of z out /k, the value of 
Q of the partitions found closely follow the expected modular- 
ity. For higher values of z out /k, the partitions found show a 
better modularity than pre-defined partitions. There is little 
difference between results for different network sizes. In the 
bottom figure the comparison between pre-defined and found 
partitions using the mutual information measure I (A, B) is 
shown. Both algorithms have similar sensitivity for both net- 
work sizes, but the sensitivity is reduced at the same value 
of z ou t/k for the larger network, suggesting that communities 
are more fuzzy the larger they are as discussed in the text. 

each of size 32 with /c = 16 and z ou t = 6, say, will have a 
better defined community structure than a network with 
the same values of k and z ou t which is comprised of 512 
nodes with four communities each of size 128. This is 
simply due to the fact that the internal links are spread 
out over a larger number of nodes, thus making the com- 
munities less dense, in terms of proportion of actual links 
to possible links. In Figure ^ we can see that the same 
algorithm will perform significantly better on a network 
with 128 nodes than on one with 512 nodes with the same 
values of k and z ou t- 

Furthermore, in real networks the distribution of com- 
munity sizes is highly skewed, and has been observed to 
follow power laws in many cases [13J, LLa, I23L |2J, |25j . We 
argue that this difference in sizes is important and affects 
different identification algorithms in different ways. To 
be able to evaluate the effect that a spread in community 
sizes will have on the performance of any algorithm, we 
first need to be able to create networks with controlled 
community structure of differing community sizes. 

Consider a set of N c communities where each commu- 
nity contains nodes. Considering pairs of nodes, if 
both nodes are in the same community a link is placed 
between them with probability Pi n , otherwise they are 
connected with probability P e . Should Pi n be constant 
for all communities, the number of links of community i 
would scale as the square of its size, nf . To give the same 
weight to communities of different sizes, we propose that 
Pi n = F/ni where F is a control parameter. In this way 
we are able to control both internal and external cohesion 
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by varying F and P e respectively. This method of net- 
work creation is equivalent to creating a random Erdos- 
Renyi network with the probability of linking being equal 
to P e and then superposing N c random networks whose 
sizes correspond to nj where the probability of internal 
linking is F/m. 

Figure [21 a and b) shows two networks with 5 commu- 
nities each, containing one community of 64 nodes and 
4 communities of 16 nodes each for two different values 
of P e and F. Figure |2t shows the value of Q when the 
network partition corresponds exactly to the prescribed 
communities as a function of F and P e . While these com- 
munity sizes are chosen to be illustrative, this method of 
network creation is completely general and community 
sizes can be drawn from any given distribution. 
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FIG. 2: (Colour online) Two examples networks created as 
described in the main text with 5 communities four of which 
have 16 nodes and one has 64, (a) has P e = 0.007 and F = 8 
and in (b) P e = 0.03 and F = 3. (c) The modularity Q 
of networks as generated in the main text for values of P e 
between 0.001 and 0.03, and values of F between 1 and 14. 
The dark zones represent parts of the parameter space where 
the networks constructed were disconnected for more than 1 
in 100 realisations. 



III. DYNAMICS OF THE FAST ALGORITHM 
AND ITS MODIFICATION 

The performance of various community identification 
algorithms has recently been studied both in terms of 
speed and in terms of accuracy. Having a method of gen- 
eration of networks with communities of differing sizes 
puts us in a position to test the way these sizes can af- 
fect the performances of identification algorithms. In 
particular we concentrate on Newman's Fast algorithm 
as proposed in 17] . It is dubbed fast since it runs in 
almost linear time for sparse networks, 0(n\og 2 n) fl8| . 
and while it is not the most accurate method, it remains 
the only algorithm able to extract community structure 
information from very large networks |14| . 

Let us consider a network that has been partitioned in 
some arbitrary way. Joining two neighbouring partitions 
i and j, would produce a change in modularity: 

dQ tj = 2(e ij - (3) 

This can be interpreted as a measure of affinity of com- 
munities i and j, and can subsequently be used to find 
the two communities which are most alike (highest dQ). 
Starting with each node in the network in its own commu- 
nity, one can join pairs of communities with the highest 
dQ. This process can then be performed and repeated 
until the whole network is contained in one community. 
As the author states in , this is very similar to agglom- 
erative hierarchical clustering methods |29t l30| . Here, 
"distance" measures such as single linkage or complete 
linkage are replaced by dQ. It also differs from hierar- 
chical clustering in that not all pairs of clusters are com- 
pared, only those connected by real links in the network. 

Let us analyse carefully how the algorithm proceeds 
when applied on the well studied karate club friendship 
network of Zachary [3l]]. Data on the network was col- 
lected over a two year period before the club split due 
to an internal dispute during which some of the mem- 
bers started their own club. The fissure is apparent in 
the topology of the network before the split (see Figure 
EJi), and this data set has become somewhat of a stan- 
dard case study for community detection algorithms in 
the literature E3, E3, 0, 123, H3, H3, H3, H3, . 

Figure shows the dendrogram as generated by the 
fast algorithm, with the different colours depicting the 
partition at the highest value of Q = 0.3807. In the first 
step of the algorithm, is simply the degree of node 
i and is 1 for any neighbour pair. Hence, the pair 
of nodes that will be joined first is the neighbour pair 
that has the smallest product of degrees. In the case of 
the karate club network, these are nodes 6 and 17 with 
degrees 3 and 2 respectively. Note that once a commu- 
nity has joined with another, the resulting community 
tends to join again, since the first term of 01 e^ , tends 
to be increased by the joining of neighbouring communi- 
ties, especially in networks with high clustering. So, the 
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cluster of nodes 6 and 17 absorb their common neigh- 
bour, node 7. This larger cluster now has an even larger 
eij to common neighbours and in the following steps ab- 
sorbs nodes 1, 5 and 11, until no common neighbours 
exist. This process occurs in a similar fashion for nodes 
24, 27, 28, 30 and 34. We observe that when choosing 
the pair of communities to be joined, large communities 
are favoured at the expense of smaller ones. In turn, this 
leads to the formation of a few large clusters in networks 
where a larger number of smaller clusters may represent 
the real community structure better. 




space from the original, in part due to this asymmetry. 
For each possible merging of neighbouring communities, 
there exists only one value of dQ, whereas dQ' takes two 
different values, if the two communities have a different 
number of links a* ^ cij . 

This normalisation insures that clusters with fewer 
links have the largest values of dQ', and therefore are 
joined earlier. Taking the karate club network as an ex- 
ample again, we see that neighbouring nodes where one 
neighbour has the smallest degree are joined first. This 
ensures that nodes with only one link are joined at the 
beginning of the process, such as node 12 (see Fig. 
Curiously, using another method based on synchronisa- 
tion recently proposed by two of us produces a very simi- 
lar dendrogram • We argue that this is a better way to 
proceed. A partition containing a single node will always 
contribute negatively to the value of Q, even if the degree 
of that node is 1. For example in |3j| the authors find 
a partition with Q = 0.412 which has node 12 as a sep- 
arate community, using an entirely different method for 
exploring the partition space. But, Qi = n = —1/78 and 
the same partition, only with node 12 contained within 
it's neighbour community, has Q = 0.418 |42] |. 

While the NF algorithm also ensures that single node 
partitions are not found in the optimal state, our modifi- 
cation performs this absorption much earlier. This means 
that in the first few steps of our algorithm will inevitably 
appear to performing worse than the NF algorithm. As 
it progresses, however, it overtakes the NF algorithm in 
terms of Q, as we can see in Figure 03. Indeed, we find 
that when our modification does not match the perfor- 
mance of the NF algorithm in terms of Q, it improves 
it. 



FIG. 3: (Colour online) (a) Zachary's karate club network 
(b) Modularity as algorithms progress (c) Dendrogram rep- 
resenting the progress of fast algorithm, where formation of 
large clusters is favoured early (d) Dendrogram representing 
the progress of our modification, all clusters are treated on an 
equal footing and individual nodes are absorbed into clusters 
early. 

To avoid this and to treat each community as equal, 
we normalise dQ by the number of links: 
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It is important to note that while the pair of nodes 
with the largest value of dQ' is chosen, the real value of Q 
must be calculated at each step using the original dQ, or 
measuring the value of Q explicitly. Note that as opposed 
to the original formulation, this measure is asymmetric, 
that is dQ'^ ^ dQ'^. But, the implementation of the 
algorithm ensures that both dQ'^ and dQL are considered 
when choosing the pair of communities to join, and, since 
we are interested in only the largest value of dQ' at each 
step, this poses no problem. In essence, the modified 
algorithm is able to take a different path in the partition 



IV. TESTING THE MODIFICATION 

To test the performance of the modification proposed, 
we have applied the algorithm on several networks, both 
ad-hoc and real. To begin with we look at networks with 
four equal sized communities, as described in p^ . 

As z ou t/k increases, the modularity of the pre-defined 
partition decreases as Q = 3/4 — z out /k irrespective of 
network or community size. Figure shows the ex- 
pected modularity value compared with those found by 
the NF algorithm and our modification. For low values of 
Zout/k both algorithms find communities with the value 
of Q following the expected value closely. For higher 
Zout/k these values deviate from the expected value as 
the communities found by the algorithm do not corre- 
spond exactly to pre-defined communities. In fact, as 
Zout/k increases above 0.5 the pre-defined partitions give 
a lower value of Q than those found by the algorithm, 
which tend towards the value that random networks ex- 
hibit due to fluctuations 0. The values of Q found by 
our modification is very similar to those found by the NF 
algorithm. 

The deviation between pre-defined and found parti- 
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tions is seen more clearly by looking at the mutual in- 
formation measure I(A, B) in the lower part of Figure 
H As z out /k increases beyond the point where commu- 
nities are well defined, the amount of information about 
community structure the algorithms are able to extract 
decreases. When the communities found have hardly any 
relation to pre-defined ones, as is the case for high z out /k, 
I(A,B) tends to zero. As network size increases how- 
ever, the algorithms are able to extract less information 
from the network structure. This supports the suggestion 
that communities in these networks become more fuzzy 
as their size increases. Once again, our modification per- 
forms very similarly to the NF algorithm. 

It seems logical that both algorithms perform with sim- 
ilar accuracy for these networks. As we have seen in 
IIIII the NF algorithm seems to favour the formation of 
larger communities. However when the communities to 
be found are all of the same size, one would expect it to 
perform quite well. Our modification has little effect in 
this case. 

The difference between the algorithms appears when 
communities of different sizes are present within the net- 
work. Using the network construction method proposed 
in Section [n] we study the performance of the algorithm 
on networks with 21 communities. The communities are 
chosen by hand, with one community of 128 nodes, four 
communities with 32 nodes each and 16 communities con- 
taining 8 nodes each. This corresponds to a size distribu- 
tion which follows a power law (with only three points), 
where the exponent is -1. In Figure 0] we show the dif- 
ference in performance between the NF algorithm and 
our modification. They are compared both in terms of 
modularity and mutual information. Our modification 
performs better in all parts of the parameter space, with 
some regions showing up to 25% improvement over the 
original algorithm. The regions where the improvement 
is largest are those where the communities are fuzzy, that 
is, for high values of external cohesion P s and low values 
of internal cohesion F. 

This suggests that our modified algorithm will perform 
better in real networks, where the size of communities 
is highly heterogeneous and the community structure is 
fuzzy. To check this, we also performed tests on some 
real networks. Table[I]shows the comparison of our mod- 
ified algorithms with Newman's original formulation and, 
where possible with the extremal optimisation algorithm. 
We looked at the network of Jazz bands with nodes repre- 
senting the bands, and links between bands representing 
at least one musician that played in bo th |26j ; the e-mail 
network of University Rovira i Virgili 23] where e-mail 
addresses are connected by exchanging messages; and the 
network of users of the pretty good privacy (PGP) algo- 
rithm for secure information transactions |38| . These are 
medium sized networks and are still tractable with the 
Extremal Optimisation (EO) algorithm [2(|, which has 
a larger running time scaling as 0(n 2 logn). In these 
networks, the EO algorithm clearly performs best out 
of the three, which is no surprise since it explores much 
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FIG. 4: (Colour online) Difference in performance between 
the NF algorithm and our modification (a) proportion of im- 
provement in Q (b) proportion of improvement in I(A,B). Our 
modified algorithm outperforms the NF algorithm in all parts 
of parameter space, but the difference is most pronounced for 
low values of P e and high values F, i.e. where the communi- 
ties are fuzzy. 



TABLE I: Table of optimal modularity values obtained by 
the Extrema l Op timisation algorithm, Qeo H3, the NF algo- 
rithm, Qnf [13, and the modification presented here, Qqm. 



Network 


Size 


Qeo 


Qnf 


Qm 


Zachary 


34 


0.4188 


0.381 


0.4087 


Jazz bands 


198 


0.4452 


0.4389 


0.4409 


E-mail 


1144 


0.5738 


0.4796 


0.5569 


PGP 


10680 


0.8459 


0.7329 


0.7462 


arXiv 


44337 


N/A 


0.7165 


0.7606 


WWW 


325729 


N/A 


0.9269 


0.9403 


Actor 


374511 


N/A 


0.6829 


0.7194 



more of the partition space than either of the others. 
It is, however, impractical to use in very large networks 
due to running time. In large networks such as the co- 
authorship network of the arXiv preprint database [39| , 
the network of web pa ges within the nd.edu domain |40| . 
or the actor network |41|, our algorithm is still able to 
run in a reasonable time. It improves on the results of 
the NF algorithm, finding partitions up to 16% better in 
terms of Q, with no tradeoff in speed. 



V. CONCLUSION 

To conclude, in this paper we have proposed a more 
realistic benchmark test for community detection algo- 
rithms in complex networks which takes into account the 
heterogeneity of community size observed in real net- 
works. We have also shown that Newman's fast com- 
munity detection algorithm tends to favour the creation 
of large communities at the expense of smaller ones. We 
propose a simple modification of the fast algorithm which 
can ensure that communities of differing sizes are treated 
on an equal footing, thus side-stepping this potential 
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problem. Upon comparing the sensitivity of our mod- 
ification to that of the original algorithm, we saw that 
they perform almost identically in ad-hoc networks with 
communities of equal size. However, when compared us- 
ing the proposed benchmark test, the improvement in 
sensitivity increases. Therefore, we claim that the het- 
erogeneity in community size should be considered when 
evaluating community detection algorithms. 

Furthermore, we have seen that our modified algorithm 
gives improved results for all real networks studied. This 
improvement is up to 16% in some studied networks. The 
improvement in results comes at no extra computational 
cost, and a reasonable implementation of the algorithm 
will run in 0(nlog 2 n) time. We recommend the use 



of this simple modification for the study of community 
structure in very large complex networks. 
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